Search CORE

14 research outputs found

IMPROVING ACOUSTIC BASED KEYWORD SPOTTING USING LVCSR LATTICES

Author: Motlicek Petr
Szoke Igor
Valente Fabio
Publication venue: Rue Marconi 19, Idiap
Publication date: 19/12/2013
Field of study

This paper investigates detection of English keywords in a conversational scenario using a combination of acoustic and LVCSR based keyword spotting systems. Acoustic KWS systems search predefined words in parameterized spoken data. Corresponding confidences are represented by likelihood ratios given the keyword models and a background model. First, due to the especially high number of false-alarms, the acoustic KWS system is augmented with confidence measures estimated from corresponding LVCSR lattices. Then, various strategies to combine scores estimated by the acoustic and several LVCSR based KWS systems are explored. We show that a linear regression based combination significantly outperforms other (model-based) techniques. Due to that, the relative number of false-alarms of the combined KWS system decreased by more than 50% compared to the acoustic KWS system. Finally, an attention is also paid to the complexities of the KWS systems enabling them to potentially be exploited in real-detection tasks

Infoscience - École polytechnique fédérale de Lausanne

Fast Approximate Spoken Term Detection from Sequence of Phonemes

Author: Hermansky Hynek
Pinto Joel Praveen
Prasanna S. R. Mahadeva
Szoke Igor
Publication venue
Publication date: 11/02/2010
Field of study

We investigate the detection of spoken terms in conversational speech using phoneme recognition with the objective of achieving smaller index size as well as faster search speed. Speech is processed and indexed as a sequence of one best phoneme sequence. We propose the use of a probabilistic pronunciation model for the search term to compensate for the errors in the recognition of phonemes. This model is derived using the pronunciation of the word and the phoneme confusion matrix. Experiments are performed on the conversational telephone speech database distributed by NIST for the 2006 spoken term detection. We achieve about 1500 times smaller index size and 14 times faster search speed compared to the state-of-the-art system using phoneme lattice at the cost of relatively lower detection performance

Infoscience - École polytechnique fédérale de Lausanne

Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding

Author: Choukri Khalid
Khalil Driss
Lenders Vincent
Madikeri Srikanth
Motlicek Petr
Nigmatulina Iuliia
Prasad Amrutha
Rigault Mickael
Szoke Igor
Tart Allan
Zuluaga-Gomez Juan
Publication venue
Publication date: 01/05/2023
Field of study

Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI systems for ATC demands large-scale annotated datasets, which are currently lacking in the field. This paper explores the lessons learned from the ATCO2 project, a project that aimed to develop a unique platform to collect and preprocess large amounts of ATC data from airspace in real time. Audio and surveillance data were collected from publicly accessible radio frequency channels with VHF receivers owned by a community of volunteers and later uploaded to Opensky Network servers, which can be considered an "unlimited source" of data. In addition, this paper reviews previous work from ATCO2 partners, including (i) robust automatic speech recognition, (ii) natural language processing, (iii) English language identification of ATC communications, and (iv) the integration of surveillance data such as ADS-B. We believe that the pipeline developed during the ATCO2 project, along with the open-sourcing of its data, will encourage research in the ATC field. A sample of the ATCO2 corpus is available on the following website: https://www.atco2.org/data, while the full corpus can be purchased through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when little or near to no ATC in-domain data is available. For instance, with the CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9% WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but supervised CNN-TDNNf model.Comment: Manuscript under revie

arXiv.org e-Print Archive

Building and evaluation of a real room impulse response dataset

Author: Igor Szoke
Jakub Paliesek
Jan Cernocky
Ladislav Mosner
Miroslav Skacel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

The Spoken Web Search Task

Author: Andi Buzo (5362652)
Florian Metze (3885196)
Igor Szoke (5362643)
Luis J. Rodriguez-Fuentes (5362655)
Xavier Anguera (5362649)
Publication venue
Publication date: 29/06/2018
Field of study

<p>In this paper, we describe the “Spoken Web Search” Task, which is being held as part of the 2013 MediaEval campaign. The purpose of this task is to perform audio search in multiple languages and acoustic conditions, with very few resources being available for each individual language. This year the data contains audio from nine different languages and is much bigger in size than in previous years, mimicking realistic low/zero-resource settings</p

FigShare

Query-by-Example Spoken Term Detection on Multilingual Unconstrained Speech

Author: Andi Buzo (5362652)
Florian Metze (3885196)
Igor Szoke (5362643)
Luis J. Rodriguez-Fuentes (5362655)
Mikel Penagarikano (5362646)
Xavier Anguera (5362649)
Publication venue
Publication date: 29/06/2018
Field of study

<p>As part of the MediaEval 2013 benchmark evaluation campaign, the objective of the Spoken Web Search (SWS) task was to perform Query-by-Example Spoken Term Detection (QbESTD) using audio queries in a low-resource setting. After two successful editions and a continuously growing interest in the scientific community, a special effort was made in SWS 2013 to prepare a challenging database, including speech in 9 different languages with diverse environment and channel conditions. In this paper, first we describe the database and the performance metrics. Then, we briefly review the algorithmic approaches followed by participants and present and discuss the obtained performances, which demonstrate the feasibility of the proposed task, even under such challenging conditions (multiple languages and unconstrained acoustic conditions). Finally, we analyze the fusion of the top-performing systems, which achieved a 30% relative improvement over the best single system in the evaluation, proving that a variety of approaches can be effectively combined to bring complementary information in the search for queries.</p

FigShare

Query-by-Example Spoken Term Detection Evaluation on Low-Resource Languages

Author: Andi Buzo (5362652)
Florian Metze (3885196)
Igor Szoke (5362643)
Luis J. Rodriguez-Fuentes (5362655)
Mikel Penagarikano (5362646)
Xavier Anguera (5362649)
Publication venue
Publication date
Field of study

<p>As part of the MediaEval 2013 benchmark evaluation campaign, the objective of the Spoken Web Search (SWS) task was to perform Query-by-Example Spoken Term Detection (QbE-STD), using spoken queries to retrieve matching segments in a set of audio files. As in previous editions, the SWS 2013 evaluation focused on the development of technology specifically designed to perform speech search in a low-resource setting. In this paper, we first describe the main features of past SWS evaluations and then focus on the 2013 SWS task, in which a special effort was made to prepare a challenging database, including speech in 9 different languages with diverse environment and channel conditions. The main novelties of the submitted systems are reviewed and performance figures are then presented and discussed, demonstrating the feasibility of the proposed task, even under such challenging conditions. Finally, the fusion of the 10 top-performing systems is analyzed. The best fusion provides a 30% relative improvement over the best single system in the evaluation, which proves that a variety of approaches can be effectively combined to bring complementary information in the search for queries.</p

FigShare

Comparison of methods for language-dependent and language-independent query-by-example spoken term detection

Author: Akbacak M.
Allauzen C.
Allauzen C.
Anguera X.
Anguera X.
Barnard E.
Cai L.
Chan C.
Cuayahuitl H.
Fiscus J. G.
František Grézl
Grezl F.
Grezl F.
Hazen T.
Hazen T. J.
Helen M.
Igor Szöke
Jan “Honza” Černocký
Jansen A.
Javier Tejedor
Kempton T.
Kim J.
Lin H.
Mamou J.
Manos A.
Mantena G. V.
Metze F.
Michal Fapšo
Muscariello A.
Muscariello A.
Muscariello A.
NIST.
Ou J.
Parada C.
Schultz T.
Shen W.
Sinischalchi S. M.
Szoke I.
Szoke I.
Szoke I.
Szoke I.
Szoke I.
Szoke I.
Tsai W.-H.
Tzanetakis G.
Vergyri D.
Vergyri D.
Walker B. D.
Wu S.
Xin L.
Zhang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref